SiSoftware Sandra Q & A - Memory Benchmark

SiSoftware Sandra - The Diagnostic Tool, Q & A - Memory Benchmark

This document provides some frequently asked questions about Sandra. Please read the Help File as well!

Q: What is STREAM?
A: STREAM is a popular memory bandwidth benchmark that has been used on personal computers to super computers. It measures sustained memory bandwidth not burst or peak. Therefore, the results may be lower than those of other benchmarks. Sandra is based on this benchmark.

Q: How is Sandra's Memory Benchmark different from STREAM?
A: STREAM 2.0 uses static data (about 12M) - Sandra uses dynamic data (around 40-60% of physical system RAM). This means that on computers with fast memory Sandra may yield lower results than STREAM. It's not feasible to make Sandra use static RAM - since Sandra is much more than a benchmark, thus it would needlessly use memory.

A major difference is that Sandra's algorithm is multi-threaded on SMP systems. This works by splitting the arrays and letting each thread work on its own bit. Sandra creates a thread for each CPU in the system and assignes each thread to an individual CPU.

Another difference is the aggressive use of sheduling/overlapping of instructions in order to maximise memory throughput even on "slower" processors. The loops should always be memory bound rather than CPU bound on all modern processors.

The other major difference is the use of alignment. Sandra dynamically changes the alignment of streams until it finds the best combination, then it repeatedly tests it to estimate the maximum throughput of the system. You can change the alignment in STREAM and recompile - but generally it is set to 0 (i.e. no).

Q: Is Sandra compatible with STREAM?
A: No. See above for the main differences. The results should reflect a comparable difference between different computers but are not comparable themselves.

Q: Why does the rating change between runs?
A: Make sure you have enough RAM (16MB or more) and only Sandra is running. If you see the hard disk light up then your computer is swapping. Accurate results can only be obtained if the computer is not swapping.

Q: What's the deal with the new block pre-fetch/buffering SSE(2)/EMMX benchmarks?
A: In a nutshell, the new tests use the pre-fetching instructions to bring data into the CPU and store the data directly into memory bypassing the caches. In order maximise throughput, buffers are used to pre-fetch data into the caches so that it is already there when needed and to reduce switches between different data streams.

For more information, please read the following whitepapers:

Intel Pentium III - SGI Whitepaper on memory copy using new SSE instructions.
Intel Pentium 4 - see Pentium 4 Code Optimisation Manual.
AMD Athlon/Duron - Athlon/Duron Optimisation Document. Relevant parts of the guide are Chapter 5, p. 66 Optimizing Main Memory Performance for Large Arrays and also the sample code in Chapter 10, p.180 which has Athlon/Duron-specific optimized memcpy() that works for any size memory block.

Q: Why do I see different reference score lists from Sandra on different systems?
A There are currently 4 different reference lists depending on tests run (Options, CPU capabilities):

Uni-Processor legacy (ALU/FPU) tests
Multi-Processor (SMP) legacy (ALU/FPU) tests
Uni-Processor advanced (buffering/code-prefetch EMMX/SSE/SSE2) tests
Multi-Processor (SMP) advanced (buffering/code-prefetch EMMX/SSE/SSE2) tests

While this may appear confusing, this was done in order to compare systems using same types of tests and thus not compare apples-to-oranges as much as possible. This should ensure a fair test and result.

Q: How am I supposed to know what (kind of) test was run?
A: Pay attention at the result bar, it should tell you all about the test as well as the result in MB/s. It should say:

Type of unit(s) used. E.g. ALU or FPU.
Type of data used. E.g. integer or floating-point.
Any techniques used. E.g. block pre-fetch, buffering, etc.
Any instruction sets used. E.g. MMX, EMMX, SSE, SSE2, etc.
The score in MB/s.

Q: Why do some systems show close to 95% bandwidth efficiency and others less than 80%?
A: While the code for all CPUs is heavily optimised, the effciency depends on chipset performance and memory settings. Only the most aggressive settings may yield > 80% efficiency, thus anything higher is a bonus.

Q: Which memory has better efficiency: SDRAM, DDR, RDRAM, etc?
A: Mostly, memory bandwidth performance depends on chipset architecture and memory timings, thus performance varies. A generic answer is beyond the scope of this document.

Q: Why do I get lower indexes with the new buffering benchmarks & SSE(2)/EMMX in SMP mode?
A: While the ALU/FPU benchmarks were also memory bound, the CPU caches were used implicitly by the system; the new benchmarks completly bypass the caches to achieve greater throughput. However, this results in more collisions/congestions if the CPU bus is shared. This results in a lower index.

This is why L2, L3 caches are so much more important in a SMP system. It is up to you to decide whether you want to measure the pure, maximum performance or the SMP performance. If you want to measure the former, disable SMP support from the benchmark's options.

Q: Why don't I get higher scores with HyperThreading/SMT enabled?
A: SMT does NOT help in memory transfers. The bandwidth available to each CPU is the same, thus using all cores would increase overhead resulting in lower scores. We're looking into using SMT for prefetching into future versions of the benchmark.

Q: I get lower scores in my HyperThreaded/SMT system than with a non-SMT CPU!
A: Please update to Sandra 2002 (8.59) or later for SMT support. Earlier versions have data alignment issues.

Q: If the benchmark is multi-threaded, why don't I get higher indexes on a SMP system?
A: The benchmark is OK. You can verify by looking at the load, number of threads and memory utilisation in Task Manager of Windows NT/2000/XP.

The issue is the bus that connects the CPUs. If it is shared and not point-to-point (e.g. Intel's (A)GTL+ as used in PPro/PII/PIII/4) the CPUs are sharing the same bandwidth so you won't see much increase due to the huge amount of data transferred by the benchmark. Since the benchmark is memory limited (in order to be correct), one CPU or more won't make much difference since the memory bus is the bottleneck. When the bus is not much utilised you get close to N increase in performance (where N is no of CPUs), otherwise you get no/small performance gain.

Q: In my SMP system all memory benchmarks (ALU, FPU, MMX, SSE2 etc.) return the same score! Why is that?
A: See above. This shows that the benchmark is working, i.e. the limit of memory throughput is reached - when no matter what you use to load/store it does not make any difference.

Q: I get abnormally high scores on Windows 2000 on my Athlon system. What's up?
A: Please update to SP2 or later. This resolves a timer chip issue that affects some boards and fast CPUs.

Q: Benchmark crashes on my SMP system with odd/non power of 2 (3, 5, 6, 7, etc.) no of CPUs!
A: Update to Sandra 2001se or later.

Q: My system is supposed to have a bandwidth of X MB/s (e.g. 800MB/s for PC100 SDRAM). Why does Sandra show less than 1/2 of it?
A: The number quoted by the manufacturer is the best case sequential read throughput. Sandra reads & writes to the memory, using different areas in SMP mode. This puts a larger stress on the memory system (including cache controllers) resulting in a lower index, but more realistic. Most programs read, compute and write back data rather than just read data. Please update to Sandra 2002 or later which uses the new instruction sets and techniques to obtain better efficiency.

Q: Why isn't there a 3DNow! (Enhanced) version?
A: STREAM uses double values (64-bit) which are not supported by either. There is no point to use floats (32-bit) just to create such a version, it would not have a purpose. There is, however, a EMMX version.

Q: Why is Sandra (2002 and later) memory index so high compared to other benchmarks?
A: This is due to using the latest instruction sets and techniques (see above) for obtaining the highest possible efficiency and thus performance out of the system. This should show what the system is able to do.

Q: Why is Sandra (2001 and earlier) memory index so low compared to other benchmarks?
A: Most other benchmarks just read or write to memory without performing any data manipulation. In real-life, no program works this way - why read something if you don't actually use it? Sandra creates 3 large arrays and performs various simple arithmetical computations on them - thus reading and writing memory. We feel this is a more objective test which measures the real throughput of the system. By using large blocks of memory (8MB+), the system cache(s) are swamped, thus the actual memory throughput is measured. Of course, the cache does have an effect - unless it is turned off.

If you compare Sandra's results with other benchmarks' (e.g. WinTune 98) large blocks copy speed, you'll find the results are comparative.

Q: My non-Intel chipset (VIA, SiS, Ali) gets very low memory index.
A: Check that the memory settings are optimised. Most of these chipsets must be aggressively tweaked to match AMD/Intel ones. Do note that most of these chipsets are value and not performance and thus cannot be expected to match it.

Q: My Intel 840 (2 RDRAM PC800 channels) gets nowhere near a 3.2GB/s index! Why?
A: The 840 is greatly FSB limited (1x 133MHz, 64-bit -> 1GB/s); furthermore, both processors share the FSB. With PIII Coppermine CPUs, single thread, Sandra 2002 gets close to 90% bandwidth efficiency which is correct. An 820 with a single RDRAM channel (1.6GB/s) gets very similar scores since it is FSB limited also. You could consider the 2nd channel of the 840 superfluous.

Q: My VIA Apollo266 (1 DDR PC2100 channel) gets nowhere near 2.1GB/s index. Why?
A: Just like the 840, the chipset is FSB limited, also both processors (in SMP mode) share the FSB.

Q: My Intel 845 (1 SDRAM PC133 channel) gets "low performance memory warning". What's wrong?
A: The maximum bandwidth of a single PC133 SDRAM @ 1GB/s is around 32% of the FSB bandwidth of the Pentium 4. The processor is starved for bandwidth which results in low performance. Both Athlon & Pentium 4 need DDR or RDRAM memory for good performance.

Q: My system gets a low score with USB/1394(FireWire) devices attached but great without! Why?
A: Do note that these are isochronous devices, thus some may take up system bandwidth even when not actively used. Find out by checking how much bandwidth each one takes in Device Manager; if it takes too much check the drivers or contact the manufacturer.

Q: My AMD Athlon gets a very low memory index.
A: Upgrade to Sandra 2001 or later.

Q: My AMD K6/K6-2 or Cyrix 6x86/MX/MII CPU gets a very low memory index.
A: These CPUs have large and fast L1 (internal) caches but the L2 caches are on the mainboard and run at FSB speed, unlike PII/Celeron where the L2 cache runs at 1/2 and full CPU speed. In most cases the caches also run in Write-Through mode, which slows down writes to memory appreciably when there are many such requests.

These processors seem to be less effective than Intel's design at the same speed when accessing memory. A bottleneck limits the memory throughput to a certain level, making higher speed processors less effective. While our initial tests using floating point instructions seemed to point the finger to the non-pipelined FPU, our current tests using integer instructions return the similar results.

Q: Why does the Win64 version of Sandra not test over 1TB of memory?
A: The current Win64 version of Sandra cannot handle more than 1TB of memory. Future versions will support more, although they may not support the whole 64-bit address space.

Q: Why does the Win32 version of Sandra not test over 2GB of memory?
A: The current Win32 version of Sandra cannot handle more than 2GB of memory. Future versions will support more, although they won't be able to support the 36-bit address space.

Q: Why does the WinCE 3.0 version of Sandra not test over 16MB of memory?
A: The current WinCE 3.0 version of Sandra cannot handle more than 16MB of memory. WinCE .Net versions can handle up to 2GB of memory.

Q: Why doesn't the benchmark include my super-duper XXXXGHz CPU?
A: While we do buy and test each and every CPU model on the market, we cannot afford to buy all the very latest speed grades of each CPU. Even if we did, we cannot update the benchmark when a new speed grade is released - we'd need to do it every week.